Fix RemoteConversation FINISHED stop-hook race by neubig · Pull Request #3191 · OpenHands/software-agent-sdk

neubig · 2026-05-10T15:56:57Z

Summary

Fixes a RemoteConversation race where a WebSocket FINISHED event could cause the remote client to stop waiting before server-side stop hooks finished evaluating whether the conversation should actually terminate.

Before this change, RemoteConversation treated FINISHED from the WebSocket stream as authoritative. That is unsafe when stop hooks can veto completion and move the conversation back to RUNNING. The client could exit early, collect outputs, and proceed while the server-side conversation was still active.

After this change:

ERROR and STUCK remain immediate terminal states.
WebSocket FINISHED is treated as a wake-up hint.
REST polling confirms the current state before the client considers the conversation complete.

Motivation

This was discovered during ProgramBench end-to-end CI. ProgramBench relies on stop hooks for compile-contract and reference-output validation. When a stop hook rejects completion, the agent must keep working; early client teardown can produce premature submissions.

Successful ProgramBench CI evidence that exercised this path:

Parent eval job: https://github.com/OpenHands/evaluation/actions/runs/25552125670 — success.
Downstream inference phase: https://github.com/OpenHands/evaluation/actions/runs/25552168816 — success.
Downstream eval harness phase: https://github.com/OpenHands/evaluation/actions/runs/25552772174 — success.

Changes

openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py
- Adds explicit immediate-terminal status handling.
- Treats WebSocket FINISHED as advisory and confirms final state through REST polling.
tests/sdk/conversation/remote/test_remote_conversation.py
- Adds regression coverage for the FINISHED-vs-stop-hook race.
AGENTS.md
- Documents the WebSocket/stop-hook race lesson for future work.

Verification

uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q
- 32 passed

Stacking

This PR is split out from #3075 so the critical conversation-termination behavior change can receive focused human review. The ProgramBench workflow/docs PR should be stacked on top of this branch and contain only the benchmark-choice/doc changes.

This PR was created by an AI agent (OpenHands) on behalf of the user.

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:384877e-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-384877e-python \
  ghcr.io/openhands/agent-server:384877e-python

All tags pushed for this build

ghcr.io/openhands/agent-server:384877e-golang-amd64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-golang-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-amd64
ghcr.io/openhands/agent-server:384877e-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:384877e-golang-arm64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-golang-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang-arm64
ghcr.io/openhands/agent-server:384877e-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:384877e-java-amd64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-java-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-amd64
ghcr.io/openhands/agent-server:384877e-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:384877e-java-arm64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-java-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java-arm64
ghcr.io/openhands/agent-server:384877e-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:384877e-python-amd64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-python-amd64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-amd64
ghcr.io/openhands/agent-server:384877e-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:384877e-python-arm64
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-python-arm64
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python-arm64
ghcr.io/openhands/agent-server:384877e-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:384877e-golang
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-golang
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-golang
ghcr.io/openhands/agent-server:384877e-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:384877e-java
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-java
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-java
ghcr.io/openhands/agent-server:384877e-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:384877e-python
ghcr.io/openhands/agent-server:384877e691edde70dcb8b2b40afa08e0cdedc8be-python
ghcr.io/openhands/agent-server:fix-remote-conversation-finished-hint-python
ghcr.io/openhands/agent-server:384877e-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

Each variant tag (e.g., 384877e-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 384877e-python-amd64) are also available if needed

Issue

Resolves #3294

The WS-driven termination path in `RemoteConversation._wait_for_run_completion` returned from `run()` on the first WebSocket terminal status. That raced the server-side stop-hook eval: when an `agent.step()` set `execution_status = FINISHED` and the lock was released at the end of one run-loop iteration, the next iteration could still flip status back to RUNNING via a stop hook that denied stopping (`hook_processor.run_stop()` returning rc=2). Clients observing the intermediate FINISHED returned, which then tore down the agent-server pod via the benchmark's `workspace_keepalive` exit. The agent never got to consume its iteration budget when hooks blocked. Empirical evidence (programbench retry-16, run 25497962453): - conv `dd86d184` (zip-password-finder): hook block at 13:30:10.936, agent did just 1 retry action (file_editor view) before pod teardown — far short of `max_iteration_per_run=200`. - conv `e4b6892e` (zoxide): hook block at 13:33:52.082, 0 retry actions — conversation died on the status revert event itself. - in-memory `history` snapshots ended at the FINISHED state update; on-disk `/workspace/conversations/.../events` had 5 extra events captured by the persist callback during the brief post-`run()` window. Fix: WS terminal status is a fast wakeup hint only — except for ERROR/STUCK, which are not subject to hook reverts and still terminate immediately. For FINISHED we fall through to the existing REST-poll path, which already requires `TERMINAL_POLL_THRESHOLD` consecutive terminal polls before returning. Any FINISHED→RUNNING revert seen on a REST poll naturally resets the consecutive-terminal counter, and we drain the WS hint queue at that point so we don't busy-spin on stale FINISHED notifications. Latency cost on the happy path (no hook block, default `poll_interval=1.0s`, `TERMINAL_POLL_THRESHOLD=3`): ~2s additional vs. the previous immediate- return on WS terminal — acceptable for benchmark and CLI workloads. Adds two regression tests: - `test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal`: seeds the WS queue with FINISHED while REST shows the post-hook RUNNING revert; asserts `run()` polls past the race window before confirming. - `test_remote_conversation_run_ws_error_still_terminates_immediately`: ensures the WS fast-path is preserved for ERROR (not subject to hook reverts). Refactors the queue-drain logic into `_drain_terminal_status_queue` and isolates the WS-immediate-terminal predicate into `_immediate_terminal_statuses`, both with docstrings explaining why FINISHED is excluded. Co-authored-by: openhands <openhands@all-hands.dev>

Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-05-10T15:57:29Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-10T15:57:34Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-05-10T15:59:16Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-sdk/openhands/sdk/conversation/impl
remote_conversation.py	689	81	88%	143, 170, 183, 185–188, 198, 207, 220–221, 226–229, 313, 323–325, 331, 405, 552–555, 557, 583–587, 592–595, 598, 614, 769–770, 774–775, 789, 835, 848–849, 873–874, 897–900, 902–903, 929, 939, 943, 952–953, 992, 1142–1143, 1237–1238, 1242, 1247–1251, 1257–1263, 1276, 1281, 1323, 1536–1537
TOTAL	28640	8307	71%

all-hands-bot

Taste Rating: 🟢 Good taste

This is an excellent fix for a subtle but critical race condition. The implementation is clean, well-documented, and properly tested.

Code Quality Highlights:

Clear refactoring with _drain_terminal_status_queue() and _immediate_terminal_statuses()
Excellent inline documentation explaining the WebSocket hint vs REST polling pattern
Strong regression tests that directly reproduce the race scenario
Proper handling of ERROR/STUCK as immediate terminal states

Process Note: This PR changes conversation termination behavior (agent behavior that could affect benchmarks). While the PR includes GitHub Actions links to successful ProgramBench runs, the custom review guidelines require either:

Links to the eval monitor (openhands-eval-monitor.vercel.app) showing completed benchmarks, and
Human maintainer confirmation of the eval results

Since these are GitHub Actions links (not eval monitor) and no human has confirmed the results in the PR, I'm flagging this for human review per the eval-risk policy.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This is a correctness fix for a real production bug discovered in ProgramBench retry-16. The change makes the system more correct by treating WebSocket FINISHED events as wakeup hints rather than authoritative termination signals. This prevents premature client teardown when stop hooks veto completion.

Risk factors:

✅ Well-tested with regression coverage for both the race scenario and ERROR fast-path preservation
✅ Changes isolated to RemoteConversation termination logic
✅ Solves a real bug (not theoretical)
✅ Evidence of successful ProgramBench runs provided
⚠️ Could add minor latency to FINISHED termination (now requires REST polls), but this is the correct behavior

VERDICT:
✅ Recommend merging after human review: Core fix is sound and necessary for correct stop-hook behavior. The eval evidence shows ProgramBench now works correctly with this change.

KEY INSIGHT:
The separation of "immediate terminal statuses" (ERROR/STUCK) from "polled terminal statuses" (FINISHED) elegantly captures the semantic difference between non-revertable errors and potentially-vetoed agent completion.

all-hands-bot

✅ QA Report: PASS

Verified the RemoteConversation FINISHED stop-hook race fix works correctly by exercising the changed code and analyzing the implementation.

Does this PR achieve its stated goal?

Yes. This PR successfully fixes the race condition where WebSocket FINISHED events could cause RemoteConversation to terminate prematurely before server-side stop hooks finished evaluating whether the conversation should actually stop.

The fix correctly implements the behavior described in the PR: WebSocket FINISHED is now treated as a wake-up hint rather than authoritative termination, while ERROR and STUCK remain immediate terminal states. REST polling with a threshold of 3 consecutive terminal polls provides authoritative confirmation before the client considers the run complete.

Phase	Result
Environment Setup	✅ Dependencies installed, project builds successfully
CI Status	✅ All critical checks passing (sdk-tests, agent-server-tests, windows-tests, cross-tests)
Functional Verification	✅ Code behavior matches specification, regression tests comprehensive

Functional Verification

Verification Approach

Created a verification script (qa_simple_verification.py) to exercise the key aspects of the fix:

Test 1: FINISHED is a hint, not authoritative termination

Verification:

uv run python qa_simple_verification.py

Result:

Test 1: FINISHED via WebSocket is a hint, not authoritative termination
----------------------------------------------------------------------
Immediate terminal statuses: frozenset({'error', 'stuck'})
✅ PASS: FINISHED is NOT in immediate terminal statuses
✅ PASS: ERROR and STUCK are immediate terminal statuses

Testing _drain_terminal_status_queue behavior...
Queue size before drain: 3
Queue size after drain: 0
✅ PASS: Queue draining correctly empties all items

What this proves: The new _immediate_terminal_statuses() method correctly excludes FINISHED from immediate termination. Only ERROR and STUCK trigger the fast WebSocket termination path. The queue draining logic prevents stale FINISHED events from causing busy-spin loops.

Test 2: ERROR and STUCK remain immediate terminal states

Result:

Test 2: ERROR and STUCK terminate immediately (not hints)
----------------------------------------------------------------------
ERROR is_terminal: True
✅ PASS: ERROR is a terminal status
STUCK is_terminal: True
✅ PASS: STUCK is a terminal status
FINISHED is_terminal: True
✅ PASS: FINISHED is a terminal status (but treated as hint in WS path)

What this proves: The fix preserves the fast termination path for ERROR and STUCK (which are not subject to stop-hook reverts), while FINISHED remains a terminal status but is handled differently in the WebSocket code path.

Test 3: REST polling confirmation pattern

Verification: Inspected _wait_for_run_completion source code to verify:

Result:

Test 3: REST polling confirmation pattern
----------------------------------------------------------------------
✅ PASS: TERMINAL_POLL_THRESHOLD is set to 3 in _wait_for_run_completion
✅ PASS: REST polling requires TERMINAL_POLL_THRESHOLD consecutive polls
   This prevents premature termination when stop hooks revert FINISHED → RUNNING
✅ PASS: Counter resets when status flips back to non-terminal (e.g., FINISHED → RUNNING)
   This correctly handles stop hook denials

What this proves:

The client requires 3 consecutive FINISHED REST polls before terminating
If status flips back to RUNNING (stop hook denial), the counter resets
This prevents the race where WS FINISHED arrives before the server's stop hook evaluation completes

Test 4: Regression test coverage

Examined the added regression test test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal in tests/sdk/conversation/remote/test_remote_conversation.py:

Test scenario:

Seeds the WS terminal-status queue with FINISHED (simulating the broadcast)
Configures REST polling to return RUNNING for first 3 polls (simulating stop hook revert)
Then returns FINISHED for subsequent polls (agent finishes again after hook allows it)
Verifies client doesn't return immediately but waits for REST confirmation (≥6 polls)

What this proves: The regression test directly reproduces the race condition scenario observed in ProgramBench retry-16 and verifies the fix prevents premature termination.

CI Verification

Command:

gh pr view 3191 --repo OpenHands/software-agent-sdk --json statusCheckRollup,state

CI Status Summary:

✅ sdk-tests: SUCCESS
✅ agent-server-tests: SUCCESS
✅ windows-tests: SUCCESS
✅ cross-tests: SUCCESS
✅ tools-tests: SUCCESS
✅ workspace-tests: SUCCESS
✅ Python API: SUCCESS
✅ REST API (OpenAPI): SUCCESS
✅ pre-commit: SUCCESS

Issues Found

None.

Summary: The fix correctly addresses the stop-hook race condition by treating WebSocket FINISHED as a wakeup hint rather than authoritative termination. REST polling with a 3-poll threshold provides reliable confirmation that the server has actually finished (and stop hooks have been evaluated). The implementation is clean, well-tested, and preserves the fast-path for ERROR/STUCK states.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-23T12:40:59Z

[Automatic Post]: It has been a while since there was any activity on this PR. @neubig, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

This comment was created by an AI agent (OpenHands) on behalf of the repository maintainers.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-24T18:03:29Z

Merged main into this branch and resolved the eval-skill documentation conflicts by keeping both newly supported benchmark names (swebenchpro and programbench). Local verification: env -u GITHUB_TOKEN uv run pytest tests/sdk/conversation/remote/test_remote_conversation.py -q → 33 passed.\n\n_This comment was generated by an AI agent (OpenHands) on behalf of neubig._

all-hands-bot

Taste Rating: 🟡 Acceptable

Found one important regression-test gap: the new test does not actually inject the WS FINISHED event after run() drains stale queue items, so it would not fail against the old behavior. Also leaving this as COMMENT (not approval) because the PR changes run-loop behavior and should get maintainer/eval sign-off.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — client/server run completion semantics affect benchmark/runtime correctness, but the implementation direction is straightforward.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26368765094

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-24T18:24:28Z

Addressed the regression-test gap in f3ab1a5 by injecting the WS FINISHED event after the /run trigger/drain point. Verified locally with ruff and targeted remote conversation tests.\n\n_This comment was generated by an AI agent (OpenHands) on behalf of the user._

all-hands-bot

Taste Rating: 🟡 Acceptable, but there is still a completion race to fix. One concrete non-inline issue: the final refresh_from_server() result is currently not enforced; if it returns running, _handle_conversation_status() returns False but _wait_for_run_completion() still reconciles and returns.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics affect eval/runtime correctness; this should stay COMMENT until the remaining race is fixed and maintainer/eval sign-off happens.

Automated review generated by OpenHands on behalf of the user.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26369466261

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-24T18:50:50Z

Addressed the latest OpenHands review in 4448c86: the final REST refresh must still be terminal before RemoteConversation.run() returns, and a new regression test covers the slow-hook finished, finished, finished, running window. Verified locally with ruff, pyright, and targeted remote completion tests.\n\n_This comment was generated by an AI agent (OpenHands) on behalf of the user._

all-hands-bot

Taste Rating: 🟡 Acceptable, but there is still one terminal-status race to fix.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics affect runtime/eval correctness, and this changes agent behavior, so this remains COMMENT for maintainer/eval sign-off.

VERDICT: ❌ Needs rework on the active-run ERROR/STUCK drain race.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/OpenHands/software-agent-sdk/actions/runs/26369803053

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-24T19:08:10Z

Addressed the active-run ERROR/STUCK drain race in e62448c: only stale finished hints are discarded now, while error/stuck stay queued for immediate handling. Verified locally with ruff, pyright, and targeted remote completion tests.\n\n_This comment was generated by an AI agent (OpenHands) on behalf of the user._

Shorten line 69 comment that triggered ruff E501 (92 > 88 chars) introduced in the round-5 commit. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-28T07:17:22Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

The fix is technically correct and the root cause analysis is thorough. Treating WebSocket FINISHED as advisory and requiring the post-run full-state snapshot as the authoritative completion signal cleanly closes the stop-hook race. The _run_armed gate preventing pre-run subscription snapshots from poisoning a future run() call is a nice touch. Two targeted comments below.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

✅ QA Report: PASS

Verified the RemoteConversation FINISHED/stop-hook race behavior with actual SDK RemoteConversation.run() calls against a live local HTTP agent-server simulator; the PR fixes the early-return behavior and preserves ERROR fast-path handling.

Does this PR achieve its stated goal?

Yes. On origin/main, the same simulated server ordering caused run() to return after a WebSocket FINISHED even though the authoritative REST state had already reverted to running, reproducing the stop-hook race. On this PR, run() ignored the per-field FINISHED, observed the running veto window, and returned only after the post-run full-state snapshot; ERROR still raised immediately without polling.

Phase	Result
Environment Setup	✅ `make build` completed successfully; no tests/linters were run.
CI Status	⚠️ 22 checks passing, 5 pending (`Build & Push` variants + `qa-changes`), 3 skipped when checked.
Functional Verification	✅ Before/after race reproduction and ERROR fast-path check passed.

Functional Verification

Harness note: I used the public SDK constructors (RemoteWorkspace, Agent, RemoteConversation) and real HTTP requests to a local agent-server simulator. The websocket transport was replaced with a deterministic event injector so the exact server ordering from the reported race could be reproduced without needing an LLM-backed ProgramBench run.

Test 1: FINISHED stop-hook race

Step 1 — Reproduce baseline without the fix:
Ran git checkout -q origin/main; OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished_race || true:

Run completed via WebSocket notification (status: finished, elapsed: 0.1s)
{"elapsed_ticks_seconds": 0.15, "error": "", "outcome": "returned", "poll_count": 1, "posts_to_run": 1, "scenario": "finished_race", "seen_statuses": ["running"]}

This confirms the bug exists: after the WS FINISHED, the client returned even though the next REST state had already reverted to running, matching a stop hook veto window.

Step 2 — Apply the PR's changes:
Checked out fix/remote-conversation-finished-hint at 932a823ee27534897f6e56fd25bb95f4a163d0ce.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished_race:

Run completed via post-run WebSocket state update (status: finished, elapsed: 0.4s)
{"elapsed_ticks_seconds": 0.5, "error": "", "outcome": "returned", "poll_count": 5, "posts_to_run": 1, "scenario": "finished_race", "seen_statuses": ["running", "running", "running", "finished", "finished"]}

This shows the fix works: run() no longer treats per-field WS FINISHED as authoritative, waits through the running stop-hook veto window, and completes on the post-run full-state snapshot.

Test 2: ERROR remains an immediate terminal signal

Step 1 — Establish baseline behavior:
Ran git checkout -q origin/main; OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py error_fast_path || true:

{"elapsed_ticks_seconds": 0.15, "error": "ConversationRunError: Conversation run failed for id=9b21f108-8f7-4aa1-89bf-b6fea6dd241e: Remote conversation ended with error", "outcome": "raised", "poll_count": 0, "posts_to_run": 1, "scenario": "error_fast_path", "seen_statuses": []}

Baseline raised immediately on WS ERROR without REST polling.

Step 2 — Re-run with the PR:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py error_fast_path:

{"elapsed_ticks_seconds": 0.15, "error": "ConversationRunError: Conversation run failed for id=f8bf31b9-f806-483c-b0b8-d1eb4ebe4c40: Remote conversation ended with error", "outcome": "raised", "poll_count": 0, "posts_to_run": 1, "scenario": "error_fast_path", "seen_statuses": []}

This confirms the PR preserves the stated immediate terminal handling for ERROR.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

…rtion (round 6) - Replace 'assert signal.from_post_run_snapshot' with explicit 'if not ... raise AssertionError(...)' (PRRT_kwDOPjFrIs6FUiPj) so the invariant guard survives python -O and optimised builds. - Strengthen test_remote_conversation_run_full_state_updates_cached_state assertions (PRRT_kwDOPjFrIs6FUiPz): add exact cost check (accumulated_cost == pytest.approx(1.25)) and key presence check ('test-llm' in usage_to_metrics) so the state-reconciliation path is properly guarded against regressions. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-28T07:33:28Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Review: Fix RemoteConversation FINISHED stop-hook race

Summary

This is a well-designed and carefully-implemented fix for a subtle but genuine race condition. The root cause is clear: the server-side LocalConversation.run loop releases its state lock at the end of each iteration, so a FINISHED status set by agent.step() is observable to WebSocket clients before stop hooks run in the next iteration. Stop hooks can flip FINISHED back to RUNNING, making the WS-delivered FINISHED advisory rather than authoritative.

Correctness

The fix is correct. The key insight — treating WebSocket FINISHED as a hint and waiting for the server's post-run full-state snapshot as the authoritative termination signal — properly closes the race without degrading the ERROR/STUCK fast path. The _IMMEDIATE_TERMINAL_STATUSES set and _RunCompletionSignal.from_post_run_snapshot flag make the two-track behavior explicit and type-safe.

The _run_armed guard is elegant: clearing before the POST and setting after it ensures the initial WS subscription snapshot (delivered during connect()) can never be mistaken for the post-run completion signal. The finally: self._run_armed.clear() cleanup means the guard is always restored regardless of how _wait_for_run_completion exits.

REST Hard Fallback

The 30-second time-based hard fallback (TERMINAL_HARD_FALLBACK_SECS) is a sensible safety valve for the case where the WebSocket drops after the server finishes but before the post-run full-state snapshot is delivered. Hardcoding 30 s is reasonable given the stated rationale (stop hooks are fast; 30 s exceeds any reasonable hook execution time).

One nuance: the terminal_first_seen_at timer resets on every REST poll exception (see inline comment). In a degenerate scenario where the server alternates between reporting terminal status and throwing HTTP errors, the 30 s threshold could keep sliding indefinitely with timeout=None. Noted inline for awareness — not a blocker.

Test Coverage

The regression suite is thorough:

test_remote_conversation_run_waits_for_post_run_full_state — verifies the client does not return on the first REST-visible FINISHED without a WS snapshot.
test_remote_conversation_run_stale_pre_run_snapshot_is_ignored — verifies _run_armed rejects pre-run WS snapshots.
test_remote_conversation_run_ws_error_still_terminates_immediately — confirms ERROR short-circuits without waiting for REST.
test_remote_conversation_run_rest_hard_fallback_when_ws_silent — exercises the 30 s fallback via time.monotonic patching.
test_remote_conversation_run_keeps_error_queued_after_running_poll — confirms ERROR is not discarded by a racing non-terminal REST poll.

The install_post_run_full_state helper is well-documented (the comment explaining why it fires from GET rather than POST — to avoid racing _run_armed.set() — is particularly valuable).

Non-Core Changes

The programbench additions across skill files, the workflow, and manage_evals.py are clean mechanical updates. The AGENTS.md <KNOWN_RACES_AND_GOTCHAS> section is high-quality institutional knowledge that will save the next developer who encounters anything adjacent to this race.

Verdict

✅ The fix is correct, the tests are solid, and the documentation is exemplary. No blocking concerns.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

✅ QA Report: PASS

RemoteConversation now waits through a simulated FINISHED→RUNNING stop-hook race and completes only after the authoritative post-run full-state snapshot.

Does this PR achieve its stated goal?

Yes. I reproduced the pre-fix behavior on origin/main: RemoteConversation.run(blocking=True) returned from a WebSocket execution_status=finished hint even though the simulator's next authoritative REST status was running, matching the stop-hook race described in the PR. On this PR, the same user-facing RemoteConversation run ignored that FINISHED hint, observed three running polls, then returned only after the post-run full-state snapshot; I also verified WebSocket ERROR still raises immediately without REST polling.

Phase	Result
Environment Setup	✅ `uv run` created project environments and imported the SDK successfully.
CI Status	✅ 31 checks successful; only this `qa-changes` check was still in progress; skipped cleanup/artifact checks noted.
Functional Verification	✅ Before/after RemoteConversation execution confirms the race is fixed and ERROR remains immediate.

Functional Verification

Test 1: FINISHED WebSocket hint during stop-hook revert

Step 1 — Reproduce baseline without the fix:
Ran from an origin/main worktree:

cd /tmp/sdk-qa-main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario race

Key output:

Run completed via WebSocket notification (status: finished, elapsed: 0.0s)
SCENARIO=race
OUTCOME=returned
ELAPSED_SECS=0.01
REST_STATUS_POLLS=1
RUNNING_REVERT_SEEN=True
POST_RUN_FULL_STATE_SENT=False
HTTP_SEQUENCE=...,POST /api/conversations/.../run,GET /api/conversations/...

This confirms the bug exists on main: the client returned from the WS finished hint while the simulator's REST state had already reverted to running, and before any post-run full-state snapshot was sent.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch fix/remote-conversation-finished-hint at commit 7c51f00439bfea78797ee0025b55cb9653646fd4.

Step 3 — Re-run with the fix in place:

OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario race

Key output:

Run completed via post-run WebSocket state update (status: finished, elapsed: 0.2s)
SCENARIO=race
OUTCOME=returned
ELAPSED_SECS=0.22
REST_STATUS_POLLS=4
RUNNING_REVERT_SEEN=True
POST_RUN_FULL_STATE_SENT=True
HTTP_SEQUENCE=...,POST /api/conversations/.../run,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/...,GET /api/conversations/.../events/search

This shows the fix works: run() did not stop at the per-field FINISHED hint, survived the simulated stop-hook running reverts, and returned only after the authoritative post-run full-state snapshot.

Test 2: WebSocket ERROR remains immediate

Ran on the PR branch:

OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py --scenario ws-error

Key output:

SCENARIO=ws-error
OUTCOME=raised
ERROR=ConversationRunError: Conversation run failed for id=...: Remote conversation ended with error
ELAPSED_SECS=0.02
REST_STATUS_POLLS=0
POST_RUN_FULL_STATE_SENT=False

This confirms the PR preserves the stated fast path for ERROR: it raises immediately from the WebSocket update without waiting for REST confirmation.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

…rationale (round 7) Add comment at the exception handler explaining why terminal_first_seen_at is reset on any REST poll error (PRRT_kwDOPjFrIs6FUzqe): we cannot confirm the server is still in a terminal state after a failed poll, so we conservatively restart the hard-fallback timer to prefer a false-negative wait over a false-positive early return. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-28T07:47:18Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Taste Rating: 🟢 Good taste

This is a well-executed fix for a real and subtle race condition. The PR has been through multiple review iterations and the current HEAD (b5bec8c) is in clean shape.

Root cause & fix correctness

The race is real: the server-side LocalConversation.run loop releases its state lock at the end of each iteration, so a FINISHED status set by agent.step() is observable by WS clients before the next iteration's stop hooks run. A stop hook returning rc=2 flips the status back to RUNNING — but the old client already exited. The fix correctly identifies the authoritative signal: not the per-field WS FINISHED event, but the server's post-run full-state ConversationStateUpdateEvent published after conversation.run()/arun() exits and pending events flush.

Three-part termination design

_IMMEDIATE_TERMINAL_STATUSES (ERROR, STUCK) — per-field WS updates are still immediately terminal for these; stop hooks cannot revert them.
Per-field WS FINISHED — excluded from _IMMEDIATE_TERMINAL_STATUSES; treated as advisory only and never enqueued for the fast-path.
Post-run full-state WS snapshot — authoritative; gated by _run_armed (set after POST /run, cleared in finally) to exclude the initial subscription snapshot that arrives during connect().

This is exactly the right separation of concerns.

Invariant guard at line 1153

The final iteration's fix — replacing assert signal.from_post_run_snapshot with an explicit raise AssertionError(...) — is correct. The assert form is silenced by python -O and some packaging tools; raise AssertionError(...) is not. The guard is currently theoretically unreachable (ERROR/STUCK raise in _handle_conversation_status before reaching this line, so only from_post_run_snapshot=True signals arrive), but documents the invariant and will catch regressions if _IMMEDIATE_TERMINAL_STATUSES is extended with a status that doesn't raise.

Hard fallback

TERMINAL_HARD_FALLBACK_SECS = 30.0 is the right safety valve for WS delivery failures (dropped socket after run finishes on the server). The timer resets on poll errors, which correctly prefers false-negative waits over false-positive early returns. 30 s is a generous but reasonable bound — stop hooks are fast; a hook revert would flip REST back to RUNNING within milliseconds, well before the threshold.

Tests

The new test cases exercise real code paths and assert on outputs:

test_remote_conversation_run_ws_finished_is_only_a_hint_not_terminal — the core regression: WS-delivered FINISHED after /run trigger does not complete run().
test_remote_conversation_run_rest_finished_revert_waits_for_full_state — REST FINISHED→RUNNING revert does not prematurely return.
test_remote_conversation_run_stale_pre_run_snapshot_is_ignored — _run_armed guard works end-to-end.
test_remote_conversation_run_ws_error_still_terminates_immediately — ERROR via WS still fast-paths (poll_interval=10s would make a fall-through take 30+ s).
test_remote_conversation_run_rest_hard_fallback_when_ws_silent — hard fallback activates when WS snapshot is never delivered.
install_post_run_full_state helper fires the WS event from the GET side effect (after _run_armed.set()) rather than from the POST side effect, correctly avoiding the arming race.

Evidence

ProgramBench CI links are present and passing (parent eval job, downstream inference phase, downstream eval harness phase).

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM — Remote run-completion semantics are core infrastructure. The change is narrowly targeted, all previously raised concerns are resolved in the current HEAD, and regression test coverage is solid. The 30 s hard fallback prevents indefinite hangs if WS delivery fails. Recommend merging after final maintainer sign-off.

VERDICT:
✅ Worth merging: The fix is correct, the invariant guard is robust, tests are real, and the evidence is present. All concerns from previous review rounds are resolved.

KEY INSIGHT:
Per-field WS FINISHED is advisory; only the server's post-run full-state snapshot is authoritative, because stop hooks can revert FINISHED before the server's run loop exits.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.

Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot

✅ QA Report: PASS

The RemoteConversation race fix was functionally verified with a real SDK client talking to a local HTTP/WebSocket server that reproduced the FINISHED stop-hook race.

Does this PR achieve its stated goal?

Yes. On origin/main, RemoteConversation.run(blocking=True) returned immediately after a per-field WebSocket FINISHED hint while REST still reported running, reproducing the premature-return bug. On PR commit b5bec8c7, the same scenario waited through the reverted running statuses and returned only after the post-run full-state WebSocket snapshot reported finished; a separate run also confirmed WebSocket ERROR still raises immediately.

Phase	Result
Environment Setup	✅ `uv run` created/synced the project environment and imported the SDK successfully.
CI Status	🟡 At check time: 31 successful, 5 skipped, and `qa-changes` in progress.
Functional Verification	✅ Reproduced the old race, verified the fixed behavior, and checked the ERROR fast path.

Functional Verification

Test 1: FINISHED WebSocket hint vs stop-hook revert race

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_race.py:

{
  "log": "Run completed via WebSocket notification (status: finished, elapsed: 0.0s)",
  "run_returned": true,
  "elapsed_seconds": 0.003,
  "run_get_count": 1,
  "statuses_returned": ["running"],
  "final_cached_status": "running"
}

This confirms the bug exists: the old client returned from run() on the per-field WebSocket finished hint even though the server's authoritative REST state had already reverted to running.

Step 2 — Apply the PR's changes:
Checked out PR commit b5bec8c723e0415a3b738ef91faea7f0e2842458.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_race.py:

{
  "log": "Run completed via post-run WebSocket state update (status: finished, elapsed: 0.2s)",
  "run_returned": true,
  "elapsed_seconds": 0.211,
  "run_get_count": 4,
  "statuses_returned": ["running", "running", "running", "finished"],
  "final_cached_status": "finished"
}

This confirms the fix works: the PR client ignored the unsafe per-field FINISHED hint, observed the server remain running for three polls, and returned only after the post-run full-state snapshot.

Test 2: WebSocket ERROR remains immediate

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_error.py on the PR commit:

{
  "raised": "ConversationRunError",
  "elapsed_seconds": 0.004,
  "run_get_count": 0,
  "message_contains_error": true
}

This confirms the PR preserves the intended immediate ERROR behavior: run() raised before any REST polling despite a 10-second poll interval.

Issues Found

None.

This QA review was generated by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-05-28T21:30:55Z

Pushed a minimal simplification in 6808b8a: RemoteConversation now treats only the post-run full-state WebSocket snapshot as authoritative completion and lets REST polling handle error/stuck fallback. Ready for another automated look.\n\nThis comment was generated by an AI agent (OpenHands) on behalf of neubig.

all-hands-bot · 2026-05-28T21:31:44Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

⚠️ QA Report: PASS WITH ISSUES

The FINISHED stop-hook race is fixed in the exercised RemoteConversation path, but QA found a related regression in immediate WebSocket ERROR handling.

Does this PR achieve its stated goal?

Yes for the primary goal. Using a short SDK harness that calls RemoteConversation.run() against a fake remote server, the base branch returned after a per-field WebSocket FINISHED while the server was still reporting RUNNING; the PR ignored that hint and waited until the post-run full-state snapshot arrived. However, the same exercise shows per-field WebSocket ERROR is no longer surfaced immediately, despite the PR description saying ERROR/STUCK remain immediate terminal states.

Phase	Result
Environment Setup	✅ `make build` completed and installed the uv development environment
CI Status	🟡 Most checks passing; `qa-changes` and several image build jobs were still pending when checked
Functional Verification	⚠️ Primary FINISHED race fixed; per-field ERROR immediacy regressed

Functional Verification

Test 1: FINISHED stop-hook race behavior

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --quiet origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished-race:

5e614afe
Run completed via WebSocket notification (status: finished, elapsed: 0.0s)
finished_race_rest_polls=1

This confirms the bug exists: the client accepted the per-field WebSocket FINISHED and returned while the fake server was still in the first REST RUNNING poll that represents the stop-hook veto window.

Step 2 — Apply the PR's changes:
Checked out PR commit 6808b8a8a4958ec8d762243f5152d125265500b9.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py finished-race:

6808b8a8
REST has reported terminal status finished for 3 polls without a post-run WebSocket snapshot; continuing to wait for the authoritative snapshot (elapsed: 0.1s)
Run completed via post-run WebSocket state update (status: finished, elapsed: 0.1s)
finished_race_rest_polls=6

This shows the fix works for the PR's main goal: run() ignored the early per-field FINISHED, continued through the simulated RUNNING revert, and returned only after the authoritative full-state snapshot.

Test 2: ERROR remains immediate terminal behavior

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --quiet origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py per-field-error:

per_field_error_result=ConversationRunError after 0.00s; REST polls=0; message=Conversation run failed for id=2d30afb2-ff95-46e8-a902-a14442d222d9: Remote conversation ended with error

This shows the prior client behavior: a per-field WebSocket ERROR was surfaced immediately without waiting for REST polling.

Step 2 — Apply the PR's changes:
Checked out PR commit 6808b8a8a4958ec8d762243f5152d125265500b9.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_conversation_race.py per-field-error:

per_field_error_result=ConversationRunError after 0.26s; REST polls=5; message=Conversation run failed for id=b835439a-8ef7-4acf-954e-c109e1980e76: Run timed out after 0.25 seconds. The conversation may still be running on the server.

This shows a regression in the stated non-FINISHED behavior: the WebSocket ERROR was ignored and the caller only got a timeout while REST kept reporting RUNNING.

Issues Found

🟠 Issue: Per-field WebSocket ERROR/STUCK updates no longer appear to be immediate terminal signals; a user can now wait until polling/timeout if REST remains stale.

This review was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot

Taste Rating: 🟢 Good taste

This is a well-executed fix for a real, production-observed race condition. The root cause analysis is accurate: the server-side LocalConversation.run loop releases its state lock at iteration end, so a per-field FINISHED WS event can arrive before stop hooks run in the next iteration. The architectural fix — treating per-field status updates as advisory and requiring the post-run full-state WS snapshot as the authoritative completion signal — is the correct approach.

The _run_armed (threading.Event) mechanism elegantly prevents the initial WS subscription snapshot from being mistaken for a post-run completion signal by gating the callback before the POST /run call. The 30-second time-based TERMINAL_HARD_FALLBACK_SECS is a well-reasoned safety valve: stop hooks complete in seconds, so 30 s provides a strong signal that the WS snapshot was lost without introducing a painful hang. Tests are comprehensive, exercise real code paths, and include a patch of time.monotonic to deterministically trigger the hard-fallback path — well above the bar for regression coverage.

[IMPROVEMENT OPPORTUNITIES]

AGENTS.md note references non-existent symbols

The newly added <KNOWN_RACES_AND_GOTCHAS> note says "the REST poll path with TERMINAL_POLL_THRESHOLD consecutive terminal polls" and "See _immediate_terminal_statuses". Neither symbol exists in the current implementation:

The count-based TERMINAL_POLL_THRESHOLD was replaced by the time-based TERMINAL_HARD_FALLBACK_SECS = 30.0.
_immediate_terminal_statuses is not a method on RemoteConversation; ERROR/STUCK are handled inline inside _handle_conversation_status.

A future contributor reading AGENTS.md and then grepping the codebase for these symbols will find nothing. Suggest updating the sentence to reference TERMINAL_HARD_FALLBACK_SECS and _handle_conversation_status.

_handle_conversation_status return value is never consumed

The method signature is -> bool and its docstring says "return True if the run is complete." However, both call sites (WS path and REST path in _wait_for_run_completion) discard the return value. Callers rely entirely on the exception-raise side effect for ERROR/STUCK. Either drop the return type to None, or add a callsite that consumes it. As written it creates a false impression that the bool controls downstream logic.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW — the change tightens run-completion semantics and is covered by 32 passing tests plus end-to-end ProgramBench CI evidence cited in the PR description. The eval-infrastructure changes (adding programbench to benchmark lists) are purely additive. The only rollback risk is if a deployment environment consistently drops the post-run WS full-state snapshot, in which case the 30 s hard fallback adds latency; that scenario is preferable to the previous premature-termination bug.

VERDICT:
✅ Worth merging: Core logic is sound, tests are solid, real production evidence provided. The AGENTS.md stale references and unused return type are polish items, not blockers.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better by adding a .agents/skills/custom-codereview-guide.md file to your branch with the /codereview trigger and context the reviewer is missing.

Was this review helpful? React with 👍 or 👎 to give feedback.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-05-28T21:45:35Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

✅ QA Report: PASS

Verified the RemoteConversation stop-hook race behavior with a functional SDK harness: the PR no longer returns on an early WebSocket FINISHED hint and still raises immediately on WebSocket ERROR.

Does this PR achieve its stated goal?

Yes. The stated goal is to prevent RemoteConversation.run(blocking=True) from treating a per-field WebSocket FINISHED update as authoritative when a server-side stop hook may revert the run to RUNNING. In the baseline run, the client returned immediately after the early WS FINISHED and before the simulated post-run full-state snapshot (delivered_full_state: false). With the PR applied, the same user-facing RemoteConversation.run(...) call stayed active through the simulated stop-hook veto window and returned only after the authoritative full-state snapshot (poll_count: 5, delivered_full_state: true).

Phase	Result
Environment Setup	✅ `make build` completed successfully with `uv sync --dev`
CI Status	⏳ 19 successful, 8 pending, 3 skipped, 0 failing at last check
Functional Verification	✅ RemoteConversation race fix and immediate WS error path exercised

Functional Verification

Test 1: `RemoteConversation.run()` must ignore early WS `FINISHED` hints until post-run full state

Step 1 — Reproduce / establish baseline without the fix:
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py.

The harness uses the public SDK RemoteConversation.run(blocking=True) against a local HTTP remote-conversation harness that behaves like the problematic server timeline: POST /run emits a per-field WS execution_status=finished, then REST reports running for the stop-hook veto window, then a post-run full-state snapshot is delivered later.

Observed baseline excerpt:

{
  "outcome": "returned",
  "error": null,
  "poll_count": 1,
  "delivered_full_state": false,
  "elapsed_seconds": 0.003
}

Log excerpt:

Run completed via WebSocket notification (status: finished, elapsed: 0.0s)

This confirms the bug: the client returned on the early WS FINISHED before the simulated server completed stop-hook evaluation and before the authoritative full-state snapshot arrived.

Step 2 — Apply the PR's changes:
Checked out fix/remote-conversation-finished-hint at 384877e691edde70dcb8b2b40afa08e0cdedc8be.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_finished_race.py.

Observed PR excerpt:

{
  "outcome": "returned",
  "error": null,
  "poll_count": 5,
  "delivered_full_state": true,
  "elapsed_seconds": 0.26
}

Log excerpt:

Run completed via post-run WebSocket state update (status: finished, elapsed: 0.3s)

This confirms the fix works: the client ignored the early per-field FINISHED, continued polling through the simulated RUNNING stop-hook veto window, and returned only after the full-state WebSocket update.

Test 2: WebSocket `ERROR` remains immediate

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_remote_ws_error.py on the PR branch. The harness emits a WS execution_status=error immediately after POST /run while REST would otherwise report running if polled.

Observed excerpt:

{
  "outcome": "raised",
  "error": "ConversationRunError: Conversation run failed for id=...: qa injected error",
  "poll_count": 0,
  "elapsed_seconds": 0.004
}

This confirms the PR preserves the intended immediate terminal handling for error states; it raised without waiting for REST polling.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

all-hands-bot

Solid fix for a subtle but real race. The core design — treating per-field WS FINISHED as advisory and only accepting full-state snapshots as authoritative completion signals — is the right call, and the _run_armed gate elegantly prevents the initial subscription snapshot from masquerading as a post-run signal. Test coverage is thorough, including the regression scenario, ERROR short-circuit, hard-fallback path, and stale-snapshot guard.

Two minor observations below; neither is a blocker.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Graham Neubig <398875+neubig@users.noreply.github.com> Co-authored-by: OpenHands Bot <contact@all-hands.dev>

openhands-agent added 2 commits May 10, 2026 15:55

docs(AGENTS): document RemoteConversation WS-vs-stop-hook race

bd2fd08

Persistent note for future sessions / agents debugging this surface so they don't have to re-derive the diagnosis from output.jsonl artifacts. Co-authored-by: openhands <openhands@all-hands.dev>

neubig mentioned this pull request May 10, 2026

Add programbench to run-eval benchmark choices #3075

Merged

neubig requested a review from all-hands-bot May 10, 2026 15:58

all-hands-bot reviewed May 10, 2026

View reviewed changes

Add programbench to run-eval benchmark choices (#3075)

1aa4176

Co-authored-by: openhands <openhands@all-hands.dev>

neubig mentioned this pull request May 18, 2026

Fix RemoteConversation FINISHED stop-hook race #3294

Closed

merge main into PR #3191

f8ba78a

Co-authored-by: openhands <openhands@all-hands.dev>

neubig added the review-this This label triggers a PR review by OpenHands label May 24, 2026 — with OpenHands AI

all-hands-bot reviewed May 24, 2026

View reviewed changes

Comment thread tests/sdk/conversation/remote/test_remote_conversation.py Outdated

test: cover post-run websocket FINISHED race (#3191)

f3ab1a5

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this This label triggers a PR review by OpenHands label May 24, 2026

neubig added the review-this This label triggers a PR review by OpenHands label May 24, 2026 — with OpenHands AI

all-hands-bot reviewed May 24, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py Outdated

Comment thread tests/sdk/conversation/remote/test_remote_conversation.py

fix: require final remote completion confirmation (#3191)

4448c86

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this This label triggers a PR review by OpenHands label May 24, 2026

neubig added the review-this This label triggers a PR review by OpenHands label May 24, 2026 — with OpenHands AI

all-hands-bot reviewed May 24, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py Outdated

fix: preserve remote error status during hint cleanup (#3191)

e62448c

Co-authored-by: openhands <openhands@all-hands.dev>

neubig removed the review-this This label triggers a PR review by OpenHands label May 24, 2026

neubig added the review-this This label triggers a PR review by OpenHands label May 24, 2026 — with OpenHands AI

fix(remote-conv): fix E501 in _RunCompletionSignal comment (ruff)

932a823

Shorten line 69 comment that triggered ruff E501 (92 > 88 chars) introduced in the round-5 commit. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot self-requested a review May 28, 2026 07:16

all-hands-bot reviewed May 28, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py Outdated

Comment thread tests/sdk/conversation/remote/test_remote_conversation.py Outdated

all-hands-bot reviewed May 28, 2026

View reviewed changes

all-hands-bot self-requested a review May 28, 2026 07:32

all-hands-bot reviewed May 28, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

all-hands-bot reviewed May 28, 2026

View reviewed changes

all-hands-bot self-requested a review May 28, 2026 07:46

all-hands-bot reviewed May 28, 2026

View reviewed changes

fix(sdk): simplify remote run completion signaling

6808b8a

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot May 28, 2026 21:30

all-hands-bot reviewed May 28, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

all-hands-bot approved these changes May 28, 2026

View reviewed changes

Comment thread AGENTS.md Outdated

fix(sdk): keep websocket error fast path

384877e

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot May 28, 2026 21:44

all-hands-bot reviewed May 28, 2026

View reviewed changes

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

Comment thread openhands-sdk/openhands/sdk/conversation/impl/remote_conversation.py

neubig merged commit 9d65e9f into main May 29, 2026
45 of 46 checks passed

neubig deleted the fix/remote-conversation-finished-hint branch May 29, 2026 04:44

Conversation

neubig commented May 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Verification

Stacking

Issue

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Verification Approach

Test 1: FINISHED is a hint, not authoritative termination

Test 2: ERROR and STUCK remain immediate terminal states

Test 3: REST polling confirmation pattern

Test 4: Regression test coverage

CI Verification

Issues Found

Uh oh!

all-hands-bot commented May 23, 2026

Uh oh!

neubig commented May 24, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neubig commented May 24, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

neubig commented May 24, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neubig commented May 24, 2026

Uh oh!

all-hands-bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

✅ QA Report: PASS

Does this PR achieve its stated goal?

Test 1: FINISHED stop-hook race

Test 2: ERROR remains an immediate terminal signal

Issues Found

Uh oh!

all-hands-bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

neubig commented May 10, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

github-actions Bot commented May 10, 2026 •

edited

Loading

all-hands-bot commented May 28, 2026 •

edited

Loading

all-hands-bot commented May 28, 2026 •

edited

Loading

all-hands-bot commented May 28, 2026 •

edited

Loading

all-hands-bot commented May 28, 2026 •

edited

Loading

all-hands-bot commented May 28, 2026 •

edited

Loading